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(54} A system for interactive organization and browsing of video 



(5?) A system for interactively organizing and 
browsing video automat ica if y processes video creating 
a video table of contents (VTOC), white providing easy- 
touse interfaces for verification, correction-, and aug- 
mentation of the automat .icaSy extracted video structure. 
Shot detection, shot grouping and VTOC generation are 
automatically determined without making restrictive as- 



sumptions about the structure or content of the video. A 
nonstationary time series model of difference metrics is 
used for shot boundary detection. Color and edge sim- 
ilarities are used for shot grouping. Observations about 
the structure of a wide class of videos are used for the 
generating: the table o( contents. The use of automatic 
processing in conjunction with input from the user pro- 
vides a meaningful video organization. 
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Description 

FIELD OF THE INVENTION 

[0001] This srsventicn resales to video organization 
and browsing, and in particular, to a system for automat- 
icalty organising: raw video into a tree structure thai rap- 
resents the videos organised contents, and allowing a 
user to manually verify, correct, and augment the auto- 
matically generated tree structure, 

BACKGROUND OF THE INVENTION 

[0G03fj Multimedia information systems Include vast 
amounts of video, audio, animation, and graphics infor- 
mation, In order to manage all this information efficiently, 
it is necessary to organize the information into a usable 
format Most structured videos, such as news and doc- 
umentaries, include repeating shots of the same person 
or the same setting, which often convey information 
about the semantic structure of the video, in organizing 
video information, it is advantageous \i this semantic 
structure is captured in a form which is meaningful to a 
user. 

[0003] Prior attempts have been made in organizing 
video. Database systems typically use attribute-based 
indexing that involves manually segmenting video into 
meaningful semantic units. Multimedia information is 
abstracted by reducing the scope for posing ad hoe que- 
ries to the multimedia database. See P. England et al : 
I/Browse: The Bellcore Video Library Toolkit, Storage 
and Retrieval for Still image and Video Databases sV : 
SPIE, 1996. Attribute-based indexing, however, is ex- 
tremely time consuming because a human operator 
manually Indexes the multimedia information. 
[0004] Computer vision systems typically use an au- 
tomatic, integrated feature oxiractlon/objecl recognition 
subsystem which eliminates the manual video segmen- 
tation of attribute-based indexing. See MM Yeung et 
a/.. Video Browsing using Clustering and Scene Transi- 
tions on Compressed Sequences, Multimedia Comput- 
ing and Networking, SPIE vol. 2417, pp 399-413, 1995; 
H.J. Zhang et a/.. Automatic parsing of news video, in- 
ternational Conference on Multimedia Computing and 
Systems, pp 45-54, 1994; and D. Swanbsrg et si. 
Knowledge guided parsing in video databases;, Storage 
and Retrieval for Image and Video Databases, SPi E vol. 
1908.. pp 13-25, 1993. These automatic methods at- 
tempt to capture the semantic structure of video, how- 
ever, they are computationally expensive and difficult, 
extremely domain specific, and create hierarchies or in- 
dexes with only a few fixed number of levels For exam- 
ple, in the article by Zhang et Bi. f known templates of 
anchor person shots are used to separate news stories 
A shot in video refers to a contiguous recording of one 
or more raw frames ot video depicting a continuous ac- 
tion in time and space In the article by Sw&nberg ot aL, 
news videos are segmented or parsed using a known 



scene structure of news programs and models of anchor 
person shots. News videos have also been segmented 
by using the presence of a channel logo, the skin tones 
of the anchor person and the scene structure of the 

s news episode. See B. Gunsal ei aL, Video indexing 
through Integration of Syntactic and Semantic Features. 
IEEE Multimedia Systems, pp 90-95. 1996 Content- 
based indexing at the shot level using motion (without 
developing a high-level description of the video) has 

10 been described by R Arman et al : Content-based 
browsing of video sequences, ACM Multimedia, pp 
97-103, August, 1994. 

{00OSJ Domain dependent approaches . however, can 
not be used to capture the semantic structure in video 

* 5 for all pass h I e scena rias , even f or a ve ry simp le domain 
such as the news. For example, not every news story in 
a news broadcast begins with an anchor person shot 
and it is difficult to define an anchor person image mode; 
that is generic io alt broadcast stations. 

BO [0006] A doma in - indep end eat app roach that ex i tacts 
stony un its for video browing appl ications, has been de- 
scribed by MM Young at aL Time -con strained Cluster 
ing for Segmentation ot Video into Story Units, interna- 
tional Conference on Pattern Recognition. C £ pp. 

£5 375-SSO, 1996. FIG. i shows a scene transition graph 
which provides a compact rep resentation that serves as 
a summary of the story and may also provide useful in- 
formation for automatic classification of video types. The 
scene transition graph Is generated by detecting shots, 

w identifying shots that have similar visual appearances, 
and detecting story units. However, the graph reveals 
only limited Information about the semantic structure 
within a story unit. For example, an entire news broad- 
cast is classified as one single story; making it difficult 

35 for users to browse through the news stories individual- 
iy. 

[0O07J Capturing the semantic structure in a video re- 
quires accurate shot detection and the shot grouping. 
Most existing shot defection methods are based on pre- 
set thresholds or assumptions that reduce the i r ap plica- 
biirty to a limited range of video types. For example . 
many existing methods make assumptions about how 
shots are connected in videos, ignoring how films/vide- 
os are produced and edited in reality: See P Aigrain et 

•is aL, The Automatic Real-Time Analysis of Film Editing 
and Transition Effects and its Applications, Computer 
and Graphlcs : Vol 18 ; No. 1 c pp. 93-103, 1994; A Ham- 
papur et aL, Digrtal Video Segmentation.. Proc ACM 
Multimedia Conference, pp 357-383, 1994; and J 

so Meng et ai> Scene Change Detection in a MPEG Com- 
pressed Video Sequence, SPIE Vol 241 9, Digital Video 
Compression Algorithms and Technologies pp 14-25, 
1995. These methods often assume that both the in- 
coming and outgoing shots are static scenes with tran- 
sit tons which last for a period no longer than half a sec- 
ond, These assumptions do not provide sufficient dats 
for modeling gradual shot transitions that are often 
present in films/videos. Existing shot detection methods 
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also assume thai time-series difference metrics are sta- 
tionary; ignoring the fact that such metrics are highly cor- 
related time signals. It is also assumed that the frame 
difference signal computed at each individual pixel can 
be modeled by a stationary, independent, identically dis- 
tributed random variable which obeys a known proba- 
bility distribution: such as the Gaussian or Laplace, See 
H. Zhang eta!.. Automatic Parsing ot Fu 8-Motion Video, 
ACM Multimedia Systems, 1 : pp. 10-28, 1 993, FIGS, 2A 
and 2B are histograms of typical inter-frame difference 
images that do not correspond to shot changes, FIG. 
2A, shows a histogram as the camera moves slowly left, 
FIG. 2B depicts as the camera moves quickly right. The 
curve of FIG, 2A is shaped differently from the curve of 
FIG. 2B Neither a Gaussian nor a Laplace fits both of 
these curves well, A Gamma function fits the curve of 
FIG. 2A well : but not the curve of FIG 28. 
[0008] Additionally many videos are converted from 
films. Video and films are played at different frame rates 
thus, every other film frame is made a little bit Sanger to 
convert it to video Consequently the video frames are 
made up of two fields with totally different (although con 
secutive) pictures m them. As a result the digitization 
produces duplicate video frames and almost zero inter- 
frame differences at five frame intervals. A similar prob- 
lem occurs in animated videos such as cartoons except 
it produces almost zero into r*f fame differences in as of- 
ten as every other frame, 

fOOO0] Color histograms are typically used for group- 
ing visually similar shots as described in M.J, Swain ei 
aL indexing via Color Histograms, Third international 
Conference on Computer Vision, pp. 390 393, 1990. 
However, a color histogram's ability to delect similarities 
when illumination variations, are present is substantially 
affected by the color space used and color space quan- 
tizing. Commonly used RGB and HSV color spaces are 
sensitive to Illumination factors in varying degrees, and 
uniform quantization goes against the principles of hu- 
man perception See G Wyszecfcj eial, Color Science: 
Concepts and Methods . Quantitative Data and Formu- 
lae, John Wiley a Sons, inc. 1982. 
[001 0] Th us : in practice, it is difficult to obtain a usefu I 
video organization based solely on automatic process- 
ing. 

[001 1] Accordingly there is a need for a system which 
makes automatically extracted video structures more 
meaningful and useful 

SUMMARY OF THE INVENTION 

[0012] A system for interactively organizing and 
browsing raw video to facilitate browsing of video ar- 
chives includes automatic video organizing means for 
automatically organizing raw video into a hierarchical 
structure thai depicts the video's organized contents. A 
user interface means is provided for allowing a user to 
view and manually edit the hierarchical structure to 
make the hierarchical structure substantially useful and 



meaningful to the user, 

[001 3| One aspect involves a method for automatical- 
ly grouping shots into groups of visually similar shots, 
each group of shots capturing structure in raw video, the 

£ shots generated by detecting abrupt scene changes In 
raw frames of the video which represent a continuous 
action in t ime and space The method includes the steps 
of providing a predetermined list of color names and de- 
scribing image colors In each of the shots using the pre- 

to determined list of color names. The shots are clustered 
Into visually similar groups based on the image colors 
described in each of the shots, and image edge infor- 
mation from each of the shots is used to identify and 
remove incorrectly clustered shots from the groups. 

* $ [001 4J Anoth e r aspect tnvolv es a method for au fomat 
ically organizing groups of visually similar shots, which 
capture structure in a video. Mo a hierarchical structure 
that depicts the video's organized contents. The hierar- 
chical structure includes a root node which represents 

£Q the video in its entirety, main branches which represent 
story units within the video, and secondary branches 
which represent structure within each oi the story units. 
The method includes the steps of finding story units us- 
ing the groups of shots, each of the story units extending 

£5 to a last re-occurrence of a shot which occurs within the 
story unit and creating a story node for each of the story 
unif$ : the story nodes def ining the main branches of the 
hierarchical structure. The structure within each of the 
story unifs Is found and the structure is attached as a 

w child node to the main branches, the child node defining 
the secondary branches of the hierarchical structure. 
[001 $| Still another aspect involves a method for in- 
teractively organizing and browsing video. The method 
includes the steps of automatically organizing a raw vid- 

35 into a hierarchical structure that depicts the video's 
organized contents and viewing and manually editing 
the hierarchical structure Io make the hierarchical struc- 
ture substantially useful and meaningful to the user. 

40 BRIEF DESCRIPTION OF THE DRAWINGS 

[0016} The advantages: nature and various additional 
features of the Invention will appear more fully upon con- 
sideration of the illustrative embodiments now to be da- 
4S scribed in detail in connection with the accompanying 
drawings. In the drawings: 

FIG. 1 shows a scene transition graph in accord- 
ance with the prior art; 
so FIGS. 2A and 28 depicl histograms of typical inter- 
frame difference images used In the prior art; 
FIG. 3 Is a block diagram illustrating the video or- 
ganizing system of the present invention; 
FIG. 4 is a block diagram illustrating the interactions 
among the graphic user interfaces; 
FIG. 5A depicts a composite image displayed by the 
browser interface: 

FIGS. SB and 5C demonstrate how a split ahead is 
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processed by the browser interface; 

FIG. BA shows a list of hue names used in a prior 

art N8S system; 

RG. 68 shows the hue modifiers used in the prior 
art NBS system: 

RGB. 7A and 7B are histograms obtained from two 
images of a soccer match which demonstrate how 
reducing the number of colors in the present inven- 
tion increases She likelihood of two similar Images 
being clustered in the same group; 
RG, 8 are images labeled with the 14 modified 
colors of the present invention; 
RG, 9A shows the color histograms of anchor-per- 
son shots grouped together based on their similar 
color distributions according to the present inven- 
tion; 

FIG. shows the color histograms of soccer field 
shots grouped together based on their s imilar color 
distributions according to the present invention; 
RGS. 1OA-10C demonstrate quantization accord- 
ing to the present invention; 
FIG. 11 is a flow chart setting forth the steps per- 
formed by the shot grouping method of the present 
invention; 

FfG. 12 deplete a group structure displayed by the 
tree view interface; 

RGS. 1 3A and 1 38 are ftow charts setting forth the 
steps performed by the method which generates the 
hierarchical tree structure; 
FIG, H Is a tree structure displayed by the tree view 
interface: and 

FIG. 15 depicts a video displayed by the video in- 
terface. 

DETAILED DESCRIPTION OP THE INVENTION 

[00 1 7] Ref err in g to Ft G 3 ; . a hbc k d ia gram i j fust rat in g 
the video organizing system ot the present invention is 
shown. The video organizing system 1 0 is Implemented 
on a computer {not shown) and comprises three auto- 
matic video organizing methods 12 closely integrated 
with three user interactive video organization interlaces 
14 (graphic user interfaces}. The automatic video organ- 
izing methods 12 include an automatic shot boundary 
(cut) detection method, an automatic shot grouping 
method ; and a method for automatically generating a 
hierarchical "tree* structure representing a video table 
of contents (VTOC) The graphic user interfaces 14 in- 
clude a browser interface, a tree structure viewer inter- 
face, and a video player interface, 
[001 8] FIG. 4 is a block diagram illustrating the inter- 
actions among the graphic user interfaces 14, The 
graphic user interfaces provide the automatic video Of- 
gaming methods with manual feedback during the au- 
tomatic creation of the tree structure. Mistakes made by 
the automatic cut detection and/or shot grouping meth- 
ods Will not allow the automatic tree structure method to 
produce a useful and meaningful tree structure, The 



graphic user Interfaces enable a user to interact with 
each of the automatic video organizing methods to ver- 
ify, correct, and augment the results produced by each 
of them. The graphic user interfaces communicate with 

s each other so that changes made using one interface 
produce the appropriate updates in the other interfaces 
it is very useful for a user to see how changes made at 
one level propagate to the other levels, and to move be- 
tween levels, The three Interfaces can be operated sep- 

10 arateiy or together in any combination, Any one of the 
three interlaces can start the other interfaces. Accord- 
ingly, any one ot them can be started first, provided the 
required tiles are present as will be explained further on, 
[001 9] The video organizing system is based on the 

*s assumption that similar repeating shots which alternate 
or interleave with other shots, are often used to convey 
parallel events In a scene or to signal the beginning of 
a semanticaiiy meaningful unit. This is true ot a wide va- 
riety of structure videos such as news, sporting events. 

£Q interviews, documentaries, and She like. For example, 
news and documentaries have an anchor-person ap- 
pearing before each story to introduce it. interviews 
have an Interviewer appearing to ask each new ques- 
tion,. Sporting events have sports action between Stadi- 
as urn shots or commentator shots. Accordingly, a tree 
structure can be created directly from a fist of similar 
identified repeating shots. The tree structure preserves 
the time order among shots and captures the syntactic 
structure of the video The syntactic structure is a hier- 

so archies! structure composed of stories, sub-plots within 
the stories ; and further sub-plots embedded within the 
sub plots. For most structured videos, the tree structure 
provides interesting insights info the semantic context 
of a video which is ml possible using prior art scene 

35 transition graphs and the like. 

[0020J In organizing raw video, the system first auto- 
matically recovers the shots present in the video by au- 
tomatically organizing raw frames of the video into shots 
using the automatic cut detection method. The cut de- 

40 taction method automatically detects scene changes 
and provides good shot boundary detection in the pres- 
ence ot outliers and difficult shot boundaries like fades 
and zooms. Each shot is completely defined by a start 
frame and an end f rame and a list of ail shots in a video 

4S is stored in a shot -fist Each shot is represented by a 
single frame, the representative f rame, which is stored 
as an image icon 

{0021 J The automatic cut detection method imple- 
ments a cut detection method that combines two pixel- 

so based difference metrics; inter-frame difference metrics 
and distribution -based difference metrics. These two 
difference metrics respond differently to different types 
of shots and shot transitions. For example, an inter- 
frame difference metrics are very sensitive to camera 

■SB moves, but are very good indicators for shot changes 
Distribution-based metrics are relatively insensitive to 
camera and object motion : but produce little response 
when two shots look quite different but have similar dis- 
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tributions These differences in sensitivity make It ad- 
vantageous to combine them for cul detection, 
£0022] The sequence of difference metrics are mod- 
eled as nonstationafy time senes signals. The sequence 
of difference metrics, no matter how they are computed, 
are just like any economic or statistical data collected 
over time. Thus, shot changes as well as tilnvto-video 
conversion processes create observation outliers in 
time series. In tum> gradual shot transition and gradual 
camera moves produce innovation outliers. An obser- 
vation outlier is caused by a gross error of observation 
or by a recording error, and only affects a single obser- 
vation. Similarly an innovation outlier corresponds io a 
single extreme innovation", and affects both the partic- 
ular observation and subsequent observations. There 
are standard methods for detecting both Innovation and 
observation outliers based on the estimate of time trend 
and autGregressive coefficients See A J. Fox, Outliers 
In Time Series.. Journal of the Royal Statistical Society 
Series 3 ; 34, pp. 350-363, 1 972; 8. Abraham et&f., Out- 
Iter Detection and Time Series Modeling ; Technarnet- 
ries : Vol. 31, Ho, 2, pp. 241 248 : May 1982; and L K 
Hotta etal, A Bnot Review of Tests for Detection of Time 
Series Outliers, E&tadisisca. 44, 142, 14S ; pp. 103-1 4S : 
1992, These methods, however, cannot be applied to 
the cut detection directly because of the following three 
reasons. First, most methods require intensive compu- 
tation, using least squares or the like, to estimate time 
trend and autoregressive coefficients. This amount of 
computation is generally not desired: second, the ob- 
servation outliers created by slow motion and the film- 
to- video conversion process cou ld occu r as often as one 
in every other sample, making the time trend and au- 
loregressive coefficient estimation an extremely difficult 
process. Finally since gradual shot transitions and 
gradual camera moves are indistinguishable in most 
cases, location ol gradual shot transitions requires not 
only the detection of innovation outlier s but also an extra 
camera motion estimation step. 
[0023] Accordingly, the automatic cut detection meth- 
od preferably implements a method that uses a zeroftv 
order autoregrassive model and a piecewise4inear 
function to model the time trend. With this simplification, 
samples from both the past and the future are used in 
order to improve the robustness of time trend estima- 
tion, More than half the samples may be discarded be- 
cause the observation outliers created by stow motion 
and film -to-video conversion processes may occur as 
often as one in every other sample. However these 
types of observation outliers are least in value and are 
easily identified After the time trend is removed, the re- 
maining value is fesled against a normal distribution N 
(0 ; s) in which s can be estimated recursively or in ad- 
vance, 

[0024] To make the automatic cut detection more ro- 
bust, a modified Koimo^orov-Smirnov test for eliminat- 
ing false positives is preferably implemented by the cut 
detection method. This test, is selected because it does 



not assume a priori knowledge of the underlying distri- 
bution function . A traditions! Kolmogorov-Smsrnov test 
procedure compares the computed test metric with a. 
preset significance level (normally at 95%), Kol 

s mogorcv-Smimov tests have been used in the prior art 
to detect cuts from videos See IK. Sethi &tal, A Sta- 
tistical Approach to Scene Change Detection, SPI £ Vol 
2420 : Storage and Retrieval for Image and Video Data- 
bases III, pp, 329-338, 1995, A single preselected sig- 

10 nificance level ignores the non-stationary nature of the 
cut detection problem. Accordingly the modified Kol- 
mogorov-Smlrnov test accounts for the non-stationary 
nature of the problem by automatically adjusting the sig- 
nificance level to different types of video contents. For 
examples one way to represent video content is to use 
measurements in the spatial and the temporal domains 
together. Image contrast is a good spatial domain meas- 
urement because the amount of intensity changes 
across two neighboring frames measures video content 

£Q in the tempo sal domain. As image contrast increases, 
cut detection sensitivity should be increased, and as 
changes occurring in two consecutive images increase, 
cut detection sensitivity should be decreased. 
[002$} The traditional Kolmogorov-Smimov test also 

£5 cannot differentiate a long shot from a close up of the 
same scene. To guard against such transitions, the 
modified Kolrnogorov-Smirnov test uses a hierarchical 
method where each frame is divided info four rectangu- 
lar regions of equal size and a traditional KoJmogofov- 

3$ Smimov test is applied to every pair of regions as well 
as to the entire image. The modified Kolmogorov-Smir- 
nov -est produces five binary numbers that indicate 
whether there Is a change in the entire image as well as 
in each of the four suhimages. Instead of directly using 

35 these five binary numbers to eliminate false positives, 
the test results are used qualitatively by comparing the 
significance of a shot change frame against that of its 
neighboring frames. 

[0G2&I Examples of a cul detection methods which 
40 employ the cut detection methods described above are 
described in copending U.S. Patent Application No. 
C8/576 ; 27 f entitled CUT BROWSING AND EDITING 
APPARATUS filed on December 21 < 1 995 and copend- 
ing U.S. Patent Application No 08/576,272 entitled AP- 
« PARATUS FOR DETECTING A CUT ;N A VIDEO tiled 
on December 21, 1995. Both applications are incorpo- 
rated herein by reference. It should be noted that al- 
though the automatic cut detection methods described 
above are preferred, other suitable automatic cut detect 
so Hon methods and algorithms can also be used In the sys - 
tem. 

[0027] After the raw frames of the video are automat- 
ically organized into shots, the browser interface ena- 
bles a user to view the shots of in the shot-list. The 
browser interface displays the video to the user in the 
form of a composite image which makes the shot bound- 
aries, automatically detected by the cut detection meth - 
od, visually easier to detect. The composite Image is 
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constructed by including a horizontal and a vertical slice 
of a single pixel width from the center line of each f rame 
in the video along the time axis, 
[0Q2S] FIQ. 5A depicts a composite image 16 dis- 
played by the browser interface Shots are visually de- 
pleted by colored bars 1S in the browser interface, al- 
lowing easy checking of shot boundaries 
[0029] Automatic shot boundary detection may pro- 
duee unsatisfactory results sn the case of slow wipes 
from one shot to the next, momentary changes in illumi- 
nation (flashes), high activity in the frames, zooms, and 
the like. Changes may also be necessary in the auto- 
maticalty generated shoMist. For example, additional 
shots may have to be added to the shot-list or two shots 
may have to be merged Into one, Any changes made In 
the shot boundaries will change ihe shoHist and the set 
of icon images depicting the shots. Accordingly the 
browser interlace also enables a user to edit the shots 
by providing a number of operations which allow the 
shot boundaries to be modified by the user These op- 
erations are referred to as split, split: ahead, merge, and 
play video. 

[0030] A split operation involves marking a point of 
split in a shot and splitting the shot into two shots. Icon 
images representing the two sftois are produced and the 
internal shot structure of the browser interface is updat- 
ed The split is used to detect shots which were not tie- 
tected automatically with the cut detection method. 
[0031] A split ahead operation is used in gradual shot 
changes, where one shot fades into another (in a tran- 
sition region) making it difficult for a user to locate the 
point where the shot should be split to get a good quality 
representative icon from the new shot created. In a split 
ahead, any point selected in the transition region pro- 
duces a correct split. The point where the transition is 
completed ?s detected by processing the region follow- 
ing the point selected by the user. FIGS. 58 and 50 dem- 
onstrate how a split ahead is processed. FIG 5B shows 
a middle icon image 20 with a frame number (5138) se- 
lected by the user based on visual Inspection o! the shot 
boundary displayed in the browser interface. This image 
does not represent the next shot as it still contains an 
overlap from the earlier shot. The last Icon image 22 has 
a frame number (5148) correctly picked by the split 
ahead operation. This is achieved using a smoothed in- 
tensity plot as shown in FIG. 50. The transition point is 
identified by following the gradient along the smoothed 
intensity plot until the gradient direction changes or the 
gradient becomes negligible. 

[0002] A merge operation Is used to merge two ad- 
joining shots into one. There are two types of merge op- 
erations: a merge back and a merge ahead These op- 
erations are used for merging a shot with a previous or 
next shot The shot to be merged is specified by select- 
ing any frame within 11 The image icon representing the 
merged shot is deleted by this operation. 
[0033] A play video operation allows the actual video 
to be played on the video player interface f rom any se- 



lected video frame Video playback may be needed to 
determine the content of the shots and detect subtle 
shot boundaries. Wh ile the video is playing, the b rowser 
interface may track the video to keep the frame currently 

£ playing at the center of the viewing area, 

[0034} The browser interface can store a modified 
shot-list containing the changes made by the user: The 
user can also trigger the automatic clustering of shots 
in the shot-list from the browser interface, to produce a 

to merge-iist which Is used by sree view interface as will be 
explained further on. 

[003S] Once the shot boundaries have been correct- 
ed with the browser interface, the automatic shot group- 
ing method groups the shots into stories, sub-plots, and 

*s further sub-plots, which reflect the structure present in 
the video. Organization of shots into a higher level struc- 
ture is more complex since the semantics of the video 
has to be Inferred from the shots detected m the video. 
The automatic shot grouping method determines the 

£0 similarity between the shots and the relevance of the 
repetition of similar shots, by comparing their represent- 
ative frame image icons generated during shot bound- 
ary detection. This involves determining whether two im- 
ages are similar. The automatic shot grouping method 

£5 uses a color method to duster the shois into initial 
groups, and then uses a method which uses edge infor- 
mation within each group to refine the clustering or 
grouping results. 

[0038] The color method used for clustering shots into 

3$ initial groups is based on a name-based color descrip- 
tion system. A suitable name-based color description 
system is described by K. L Kelly etaL, The ;S0C -UBS 
Method of Designating Colors and A Dictionary of Color 
Names, National Bureau of Standards Circular 553. No- 

35 vember 1 , 1 955.. The Kelly &i at ISCONBS color system 
is incorporated herein by reference. The ISGC-N8S 
color system described by Kelly &i at divides Munsefi 
color space into irregularly shaped regions and assigns 
a color name to each region based on human perception 

40 of coior and common usage The SSCO-NBS system al- 
lows color histograms to be used for automatic shot 
grouping without concern for how color space is quan- 
tized and in modified form, allows a color description to 
foe constructed independent of the illumination of a 

•is scene. Since the color names are based on common 
usage> the results are more likely to agree with a user's 
perception of color similarity. The other advantage of us- 
ing a name-based system is that it allows development 
of user interfaces using natural language descriptions 

so of color, 

fQG07j Each color name in the NBS system has two 
components: a hue name and a hue modifier. FIG. 6 A 
shows a list of hue names used in the NBS system, and 
FIG SB shows the hue modifiers used, eg Very deep 
purplish blue* is a possible color name. However; at! 
combinations of hue name and modifiers are not valid 
names There are a total of 267 valid color names, ob- 
tained by dividing Munseil color space into irregularly 
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shaped regions. The conversion from the Munseil color 
space to ihe color name are described in exhaustive ta- 
bles and is purely based on observations; as no conver- 
sion formulae are available. 

[0038] A modified: ISCG-NBS system is used in the 
present invention to maintain the description of an im- 
age independent of its illumination* tn the modified NBS 
system: oniy the hue names are used instead of the fui; 
color names. This modification substantially improves 
the accuracy of clustering, as two similar images are 
more likely to be clustered in the same group. In addi- 
tion, white and black are used to describe certain colors 
instead of the actual hue names. Without this modifica- 
tion, unexpected classifications of color have been ob- 
served. For example, the use of the color name "green * 
with the modifier of Very pale" results in ''very pale 
green H which is actually closer to white than green. Sim- 
ilarly Very dark green* is closer to black than green. An 
" indeterminate" h ue label is also used for colors with the 
modifier *grayish H The number of colors are also reduce 
to 14, by merging some of the colors into their more 
dominant component. For example, "reddish orange* is 
considered to be "orange* 

[0039] FIGS. 7 A and 78 are Histograms obtained from 
two images oi a soccer match which demonstrate how 
reducing the number of colors increases the likelihood 
of two similar images being clustered in the same group. 
The grass color of the soccer field varies in the shade 
or type of green along the field. Consequently; the color 
of the soccer field m each image is of a different shade 
or type The histograms of FIG, 7 A were generated us- 
ing the modified list of 1 4 colors while the histograms of 
FIG. 7B were generated using the all the standard hue 
names. Using the modified lis! of 14 colors, all types of 
green are labeled "green* (color number .2) therefore, 
the histograms of FIG. 7A are very similar. However, 
when all hue names are used, the green label is divided 
Into "olive green", "yellowish green", "bluish green", e!c. 
The histograms of FIG. 7B appear different since the 
proportions of the different shades of green are not the 
same in the two images. 

[0040] FIG. 8 are Images labeled with the 14 modified 
colors. The top row 24 depicts the original images and 
the lower row 26 depicts the Images labeled with the 1 4 
modified colors 28, 

[0041] After labeling the images with the 14 modified 
colors, normalized histogram bin counts are used as 
feature vectors to describe the color content of an im- 
age, FIG. 9A shows the color histograms 30 of anchor- 
person shots grouped together based on their similar 
color distributions and FIG. 9B shows the color histo- 
grams 32 of soccer field shots grouped together based 
on their similar color distributions. 
[0042] Once the shots are clustered into initial groups, 
edge information is used as a fitter to remove shots in- 
correctly gro uped together based on color This may oc- 
cur because of the limited number of colors used and 
the tolerances are allowed when matching histograms. 



Thus, visually dissimilar images with similar color distri- 
butions may be grouped together. 
[0043] Filtering is accomplished by classifying each 
edge pixel to one of four cardinal directions based on 

£ the sign and relative magnitude of the pixel's response 
to edge operators along x and y directions. The histo- 
gram showing pixel counts along each of the four direc- 
tions is used as a feature vector to describe the edge 
information in the image. This gives gross edge inter- 

10 mation in the image when the Image is simple. The im- 
age is simplified by quantizing it to a few levels {4 or 8), 
using a quantizer and converting the quantized image 
to an intensity image. Image quantizing using color 
quantizers is discussed in an article by X. Wu : Color 

* » Quan uzer v. 2, G rapblcs Gems, Vol, II > pp 1 28 - 1 33. This 
information is sufficient to filter out substantially all oi 
the false shots in a group 

[0044] FIGS. 1 0A-10C demonstrate quantization. 
FIG. 1 0A depicts an original image prior to quantizing. 

£Q FIG. i 08, depicts the image after \\ has beer- quantized 
and FIG 10C, depicts the edge pixels of the image 
[0045] The choice of clustering strategy is limited by 
having no a prion knowledge of the number of casters 
or assumptions about the nature of the clusters. It can 

£5 not be assumed that similar images will be temporally 
close to each other in the video, since the repeating 
shots are likely to be scattered throughout the video. 
Therefore, prior art clustering strategies which involve 
comparisons among all possible elements in a limited 

30 window are not suitable. The number of potential clus- 
ters a priori; is not known, so known K-rneam clustering 
and other strategies using this a priori information are 
also not useful, 

[0046$ Moreover, It would be advantageous if the clus- 
35 taring strategy is no* off-line i.e., did not require all the 
shots to be present before startin g. This allows the shots 
to be processed as they are generated 
[0047] The preferred shot grouping method is based 
on nearest neighbor classification, combined with a 
threshold criterion. This method satisfies the constraints 
discussed above, where no a priori knowledge or model 
is used. The Initial clusters are generated based on the 
color feature vector of the shots. Each initial cluster is 
specified by a feature vector which is the mean of the 
color feature vectors of its members, When a new shot 
is available, the city block distance between its color fea- 
ture vector and the means or feature vectors of the ex~ 
ist in g c I u ste rs i s com pu i ad.. Th e n ew shot is g rou ped in to 
the cluster with the minimum distance from its feature 
so vector, provided the minimum distance is less than a 
threshold, if an existing cluster is found for the new shot 
the mean (feature vector) of the cluster is updated to 
include the feature vector of the new shot. Otherwise, a 
new cluster is created with the feature vector of the new 
3$ shot as its mean The threshold is selected based on 
the percentage of the image pixels that need to match 
in colot; in order to call two images similar. 
[0046] During post-processing of the color-based 
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generated initial clusters, shots are deleted from the 
cluster if the distance of their edge feature vector from 
the mean edge vector of the group is greater than a 
threshold, starting with the shot furthest from the mean 
edge vector. The mean edge vector is recomputed each 
time a. member is deleted from the cluster. This is con- 
tinued till all the edge feature vectors of the members in 
the cluster are within the threshold from the mean edge 
vector of the cluster, or there is a singie member left in 
the cluster The threshold is a multiple of the variance 
of the edge vectors of the cluster members. Conse- 
quently, the final clusters are based on color as well as 
edge $imHar%; allowing the color feature to be the main 
criterion in determining the clusters, 
[0040] A rne rge \ i stJs p reduced by the automatic c I us- 
tering which identifies a group number for each shot in 
the shot -list. Other features may a bo be used to pro- 
duce the clusters, including audio similarity. The merge- 
fist is used for automatically constructing the tree struc- 
ture or VTOC 

[0050] FIG 11 is a flow chart setting forth the steps 
performed by the shot grouping method described 
above. In step A } the method commences with a single 
cluster containing a first shot..- in step B, the color feature 
vector > C s of a new shot is obtained. In step C y the near- 
est match between C and means of existing clusters is 
found, in step D, it the nearest match Is less than the 
color threshold, then in step E, a shot is added to the 
cluster producing the nearest match and the cluster 
mean is updated after including: the new shot vector, if 
the nearest match is not less than the color threshold 
then in step a new cluster with C as mean is created 
and the new shot is added to ii Then from either step E 
or step F it is determined whether more shots exist in 
step G. If so, then (he method starts over at step 8. U 
no more shots exist then In step H, all clusters with more 
than one member are marked as unchecked; these clus- 
ters are checked using edge information in the following 
steps. In step L an unchecked cluster with more than 
one member is found: then the edge feature vectors, E 
for each member are computed, and the mean edge 
feature vector M for the cluster is also found. In step d, 
the member of the cluster which gives a maximum ab- 
solute value for {M - E) is obtained In step K. it is 
checked if the maximum absolute value for (M - E) is 
greater than the edge threshold. If the test in step K is 
true, then in step L the member is deleted from the clus- 
ter and placed in a new cluster and the cluster mean is 
recomputed and, If step Mi shows that the cluster still 
contains more than one member, the method returns to 
step J, else it goes to step M If the test in step K is false, 
then the method goes to step N where the cluster is 
marked as checked. In step O, ft is tested whether there 
are more unchecked clusters. If yes, the method goes 
to step I. otherwise in step R a merge-list is written out 
which specifies a group number for each shot in the 
shot-list 

[0051] The merge-iist generated by the automatic 



shot grouping method is used for generating the tree 
structure The tree structure captures the organization 
of the video and Is easy for users to understand and 
work with, in the tree structure, a whole video is a root 

£ node that can have a number of child nodes each cor* 
responding to a separate "story* in the video. Each story 
node can have further children nodes corresponding to 
sub-plots in the story and the sub-plots may be further 
sub-divided and so on. A story is a self-contained unit 

10 which deals with single or related subjects). Sub-plots 
are different elements in a story un it or sub-plot unit. The 
tree structure has different types of nodes, the type of 
node providing semantic information about its contents, 
Each node also has a representative Icon, allowing 

*s browsing without having to unravel the full structure. 
Each new story starts with a story node (main branch 
node) consisting of sub -plot nodes (secondary branch 
nodes) for each sub-plot Similar nodes are used to bind 
together all consecutive f names found to be in the same 

£Q group by the automatic shot grouping method. Fre- 
quently, these nodes may be replaced by any one of its 
members by merging the other shots. Leaf nodes are 
the final nodes on the main end secondary branch 
nodes. The leaf nodes contain the shots ttom the shot- 

£5 list 

[0052] The tree view interlace allows a user to view 
and modify the shot groups generated by the automatic 
shot grouping method. RG. 1 2 depicts a group structure 
S4 displayed by the tree view Interface At this point of 

30 video organizing; there are only two types of nodes 38. 
40 attached to the root node 38 If the group contains a 
single member, the member shot Is attached as a teal 
node to the root. For groups containing mora than one 
member, an intermediate group node is attached, which 

35 contains the member shots as its children . The tree view 
interface a^ows a user to move shots out of groups, 
move shots into existing groups or create new groups 
using operations which will be explained further on, A 
modified merge-list can also foe generated which re- 

& fleets the changes made by the user. Since the tree 
structure is constructed from the merge-Hat, the shot 
groups must be modified before the tree structure is 
loaded. 

[0053] After correcting the results of the automatic 
4S shot g roup ing met hod, th e tree st rocttj re can now be a u - 
tomatically generated. The preferred method used for 
generating the tree structure contains two major func- 
tions One of these functions is referred to as the "cre- 
ate- WOC-from-merged- list function- and the other 
£0 function referred to as the "find-structure* function. 
The create-VTOG-from-mergad-list function uses a 
method which finds all the story units, creates a story 
node for each story unit and initiates the find -structure 
function to find structure within each story: In the create- 
SB VTOC-trom-merged-iist function used in the present in- 
vention, each story unit extends to the last re -occur- 
rence of a shot which occurs with i n the body of the story. 
The "find-structure" function takes a segment of shot in- 
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dices., traverses through the segroeni to create a node 
for each shot until it one shot that reoccurs later. 
At this point find-structure function divides !he rest oi 
the segment into sub-segments each of which is lead 
by the recurring shot as a subplot node and recursively 
calls itself to process each sub-segment, tf consecutive 
shots are found to be similar, they are grouped under a 
single similar node. The structure produced by the find- 
structure function is attached as a child of the story node 
for which \i was called. 

J00S4] FIGS. ISA and 13B are flow charts detailing 
the steps described above. FIG. 13A depicts the create- 
VTOG-from-merged-Hs* portion of the method. In step 
50, a pseudo or root node at the first level of the hierar- 
chical structure is created to represent the entire video, 
shot index M is defined and assigned the value of 1 and 
the current level of the hierarchical structure, designated 
I LEVEL, is assigned a value ot 2. The story units are 
found in the shot groups using shot indices L and E in 
step 52. Since each story unit extends to a last re -oc- 
currence of a shot occurring in the story unit, index L is 
set to index M, and index E is set to the shot index of 
the last occurrence of shot L m the video., if L rs less than 
or equal to E rh step 54, the method determines whether 
the shot index of the last occurrence of shot L in the 
video is greater than E in step 56. II it is : E is set to the 
shot index of the last occurrence of shot L in the video. 
m step S8, L is sat to L+1 in step 60, and the method 
then returns to step 54 and processes L+1 . However, if 
L is not less than or equal to E in step 64. then the meth- 
od determines if a story has been found in step 62 by 
determining whether index M Is greater than or less than 
index E, If index M is greater than or less than index E. 
then a story node at level \ LEVEL is created in step S4 
to represent a found story, and the find-structure portion 
of the method (described below) is called up to find the 
structure within each story by processing index M. E. 
and I LEVEL ft . ft no story is found in steps 52, 54 and 
82, then no node is created and the find-structure por- 
tion of the method is executed to find structure in any 
existing stories by processing index M, E, and I LEVEL. 
After steps 84 or 66 : index M is set to E*1 in step 68 
and then it m determined if there are additional shots to 
process in step 70 It additional shots are present, step 
52 is executed, if no more shots need processing then 
the method stops at step 72. 

[0G5S] FIG. 13 8 depicts the find-structure portion of 
the method. The find-structure portion of the method re- 
ceives a segment of shot indices (START-SHOT END- 
SHOT i LEVEL) and traverses through the segment to 
create a node for each shot until it finds one shot that 
reoccurs in the segment. The find-structure portion can 
be determined by providing a variable S for the shots 
and setting 8 to START-SHOT in step 74 and determin- 
ing whether shot 8 is less than or equal to the end shot 
in step 76. if S is less than or equal to the END-SHOT 
then : SET-OF-SHOTS is set to the list ot shot indices ot 
ail occurrences ot shot S in ascending: order In step 78, 



If S is not less than or equa l to the end shot, the method 
ends at step 102. Then in step 80, it is determined 
whether the SET-OFSHOTS has only one node (only 
one similar shot), if the SET-OF-SHOTS has only on© 

£ node (no reoccurnng shot is found), a node at level IL- 
EVEL representing shot S is created in step 82 and S is 
set to S+1 in step 84 If, however the SET-OF-8HOTS 
has more than one node, one or more reoccurnng shots 
are in the segment. At this point the rest ot the segment 

10 is divided into sub-segments by setting S 1 io the 1 st el- 
ement in the SET-OF-SHOTS and setting S2 is set to 
the second element in the SETOF^SHOTS in step 86. 
The sub-segmems are each identified by one of the 
found reoecurring shots as a sub-plot node by determln- 
ing whether Si and S2 1 are the same (whether con- 
secutive shots are similar} in step 38, and creating a 
subplot node at level I LEVEL representing a group of 
shots in step 90 i? SI and S2-1 are not the same. A node 
at level 1LEVEL+1 representing shot Si is also created. 

£0 and the method recursively calls itself to process 81 +1 . 
S2-1 .. and I LEVEL +1. Then in step 92. SI is set to S2. 
where S2 is set to the element after S2 in the SET-OF 
SHOTS, in step 94, it is determined whether S2 is the 
last element in the SETOF-SBOTS, if it & not then the 

£5 method returns to step SS and continues through the 
steps 90, 92 and so forth, if S2 is the last element in the 
SETOF -SHOTS, then in step 98, it determined whether 
S2 is the END-SHOT If S2 is not the last element in the 
SET-OF -SHOTS, a subplot node at level i LEVEL repre- 
ss seating another group of shots is created, a node at level 
I LEVEL-*- 1 representing shot S2 is created and the 
method recursively processes S2 f 1 . END-SHOT and 
1LEVEL4-1. if S is the END-SHOT then in step 100. a 
node at level I LEVEL representing shot S2 is created 

35 At ihe conclusion of steps 98 or 100 ; the method ends 
at step 102 Returning again to step 88. if consecutive 
shots are similar, a similar node at level I LEVEL is cre- 
ated in step 104 and similar shots are grouped together 
under the similar node m steps 106 and 108. Once sim- 
jlar shots are processed, step 94 Is executed. The struc- 
ture produced by the find structure portion of the meth- 
od, is attached as a child node of the story node for 
which me find structure function was called. 
[DOSS] Although the tree view interface allows a user 

4S to view and modify the shot groups generated by the 
automatic shot grouping method:. Its primary function is 
to enable the user to view and modify the free structure 
description ot the video generated from the merge-lisi 
by the tree structure generating method. FIG, 14 is a 

so tree structure displayed by the tree view interface. Note 
that each node has a representative icon, allowing 
browsing without having to unravel the full structure. The 
video is represented by a roof node 42 and each story 
is represented with a story or main branch node 44. The 
subplots in each story is represented with a subplot or 
secondary branch node 46. Leaf nodes 48 contain the 
shots from the shot-list The tree view interface gives 
the user full freedom in restructuring the tree structure 
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to produce a meaningful video organisation. Often, se- 
mantic information can be missed or misinterpreted by 
the method which automatically generates the tree 
sir Ltctufe. The tree view interface includes operations for 
moving, adding, deleting, and updating nodes These 
operations faeiiftatechanges in the tree structure These 
operations are also provided when the tree view inter- 
face is used for editing the shot groups. 
[00$7| The node moving operation of the tree view in- 
terface allows a user to move nodes either one at a time 
or in groups. Node moving is a two-step process involv- 
ing selecting one or more nodes to be moved and se- 
lecting a destination node. The moved node(s) are add- 
ed as siblings of the selected destination node, either 
before or after (default choice) the destination node, 
[00$$] The add node operation allows a user to add 
new leaf nodes only through changes sn the shot-list us- 
ing the browser interface. However, ail types of non-teat 
nodes can be added to the tree. To avoid the creation 
of empty nodes, an existing node has to be selected to 
be a child of the new node created. A destination node 
also needs to be selected to specify the position where 
the new node is to be attached. 
[O0S9] The delete node operation is an automatic op- 
eration Leaf nodes can only be deleted through chang- 
es in the shot-list using the browser interface. Nodes 
with children cannot be deleted. Ail other (non-teat) 
nodes are deletes automatically when they have nochil- 
dren. 

[0060] The update operation uses cues from the user, 
e.g. when the user moves a shot (node) from one group 
to another group., to further reduce the effort which is 
needed to modify the automatic shot grouping results. 
The update operation first searches for whether image 
portions of two shots being merged by the user are sim- 
ilar. For example, in a news broadcast, each of the two 
shots may have an anchor person sitting at a desk with 
a TV in the background. However, the TV image portion 
of each shot may be different thereby indicating that the 
subject matter of the groups was not similar. According- 
ly, partial match templates are then generated to block 
the TV image so that the system can look in aft the re- 
maining groups for nodes (shots) having an anchor per- 
son sitting at a desk with a TV In the background (TV 
image is blocked). Shots (nodes) found in the remaining 
groups with the anchor person/TV background Image;, 
are then automatically moved to the group where the 
shot moved by the user was placed. 
[0061] I? the update operation can not find any similar 
portions in the two merged shots it will compare the au- 
dio streams of the merged shots to determine if both are 
generated by the same speaker {person} For example, 
the news stories in a particular news program may not 
always start with an anchor person shot thus, two differ- 
ent shot groups may have been generated by the auto- 
matic shot grouping method, in this scenario, audio 
streams of shots in all the remaining groups will be com- 
pared to see if they were prod uced by the same speaker 



Shots (nodes) found to the remaining groups, havingau- 
dio streams produced by the same speaker are then au- 
tomatically moved to the group where the shot moved 
by the user was placed,. 

s [0062] Finally, I! the two shots merged by the user 
have completely different visual and audio leatu res, the 
update operation wit! repeat the previous operation on 
ail the siblings of the shot (node) which was selected for 
the previous operation. For example, if a sub-plot node 

10 is deleted by moving all its members to another sub-plot 
node, just one member needs to be explicitly moved 
The other members will be moved with the update op^ 
eration, 

[0003] The user can Invoke these operations to re- 
* » g ro u p th e shots i n to more m ean i n g f u ! stories and s u b- 
piots The order of shots can also be changed from their 
usual temporal order to a more logical sequence. When 
used along with the Browser at! possible changes to the 
content and organization of the tree are supported. 
B0 [00S4] The tree str ucture is stored as a tree-list lite so 
that organized videos can display the tree structure with- 
out executing the processing steps again. Modifications 
made by the user in the tree structure are also saved in 
the tree-list 

£5 [0085] As mentioned earlier, any one of the graphic 
user interfaces can be started first, provided the re- 
quired files are present. A shot-list is needed to start the 
browser interface. The tree view interface starts with the 
group structure only if a mergeTist Is present, otherwise 

w it starts with a tree structure stored in a tree-list. 

[DOSS] There are a number of specific interactions in- 
volving the browser interface. The browser Interface can 
produce a change in the shot-list. This information is 
provided to the tree view interface via a message and 

35 the change becomes visible immediately i.e., a new shot 
appears at the specified location or a shoi gets deleted 
automatically This helps the user to actual ty see the 
icons representing the shots that are being created or 
deleted. The visual information from the tree can be 

40 used to determine actions taken in the browser inter- 
face. For example, when two consecutive representa- 
tive icons depicted by tree view interface cover a very 
similar subject matter, the user may choose to merge 
them even though a shot change is visible from the dis- 

•is play of the browser interface The user may also opt to 
reload the tree view interface using the new shot-list to 
edit the clustering, when there have been enough 
changes in the shot-list to make an earlier free structure 
obsolete,. 

£0 [0007]: The tree view Interface is associated with a 
number of interactions also. When changes are made 
in the order ol the shots and the user warns to see these 
changes reflected in the browser interface, the user can 
opt to send a signal to the browser interface to reload 
the rearranged shoHist after saving \i from tree view. 
Moreover, the video player interface can be played from 
the tree interface exactly as in the browser interface 
[O0SS| Since the browser and tree view interlaces 
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work with higher level representations of the video, the 
video player interlace allows a user to view a video and 
its audio from any point in the video. The video player 
interface has the functionality of a VCR including fast 
forward, rewind, pause and step, RG, 15 depicts a video 
displayed by the video interface 
[00S9] ft is understood that the above-described em- 
bodiments illustrate only s few of the many possible spe- 
cific embodiments which can represent applications ot 
the principles of the invention. Hence, numerous modi- 
f ications and changes can be made by those skilled in 
the art without depa?iing from the spirit and scope ofthe 



Claims 

1, A system for interactively organizing and browsing 
raw video to facilitate browsing of video archives, 
comprising: 

automatic video organizing means for automat 
icaiiy organizing a raw video into a hierarchical 
structure that depicts the video's organized 
contents; arid 

u ser interface means for allowing a user to view 
and manually edit the hierarchical structure, 

2. The system according to claim 1 . wherein the auto- 
matic organizing means includes shot detecting 
means for automatically detecting abrupt scene 
changes in raw frames of the vfcteo and automati- 
cally organizing the raw frames into a iisl of shots, 
each of the shots representing a continuous action 
in time and space. 

& The system according io claim 2 ; wherein the user 
interface means Includes browser Interface means 
for allowing the user to view the shots, add new 
shots to the list of shots, and merge the shots into 
a single shot 

4 The system according to claim 1 f wherein the auto- 
matic organizing means includes shot grouping 
means for automatically grouping shots, which rep- 
resent a continuous action tn time and space, into 
groups of visually similar shots, each group of shots 
capturing a given structure *n the raw video. 

5> The system according to claim 4> wherein the user 
interface means includes free view interface means 
for allowing I ho user io view the groups of visually 
similar shots, create new groups of visually similar 
shots, and modify the groups of visually similar 
shots. 

6, The system according to claim 5, wherein the tree 
view interlace means includes update means for 



determining whether any image portions of a shot 
merged with another shot by the user are similar, 
wherein when said update means finds similar im- 
age portions; said update means generates partial 

£ match templates which blocks dissimilar image por- 
tions of remaining shots and automatically merges 
the remaining shots that have similar image por- 
tions, wherein when said update means does not 
find any similar image portions, said update means 

10 determines whether audio streams of the two 
merged shots are similar, wherein when said up- 
date means finds simitar audio streams, said up- 
date means searches other shots in the groups tor 
similar audio streams and automatically merges 

*5 other shots with similar audio streams together, 
wherein when said update means does not find any 
similar audio streams in the two merged shots, the 
update means repeats the user's action on ail sib- 
lings of the merged shot. 

BO 

7* The system according to claim 1 s wherein the auto- 
matic organizing means includes hierarchical struc- 
ture generating means for creating the hierarchical 
structure from groups ot visually similar shots, each 
£5 of the shots representing a continuous action in time 
and space. 

8, The system according to claim 7, wherein the user 
interface means includes free view interface means 

so for allowing the user to view and modify the hierar- 
chical structure to make the hierarchical structure 
substantially useful and meaningful to the user 

9, The system according to claim 1 , wherein the user 
35 interface means includes video player means for al- 
lowing the user to play the video along with the vid- 
eo's audio from any point in the video. 

1 0 , A method used in a utomat seal iy organ iz i n g video for 
<#> automatical ly grouping shots into groups of visually 

similar shots, each group of shots capturing struc- 
ture in a raw video, the shots generated by defecting 
abrupt scene changes in raw frames of the video 
which represent a continuous action in time and 
space, the method comprising the steps of; 

providing a predetermined list of color names; 
describing image colors in each ofthe shots us- 
ing the predetermined list of color names; and 
so c \ usten ng the shots Into vis ua i \ y s \ mi la r g roups 

based on the Image colors described in each of 
the shots. 

11, The method according to claim 10, further compris- 
es ing the step of using image edge information from 

each ot the shots to identify and remove incorrectly 
clustered shots from the groups. 
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12. The method according to claim 10. wherein the fist 
of color names includes a plurality of hue names. 

13. The method according to claim 1 0, wherein the step 
of describing includes the step of obtaining a color 
histogram for each shot based on the predeter- 
mined M of color names 

14. The method according to claim 1 4, : wherein the step 
of describing further includes the step of normaliz- 
ing bin counts of the color histograms to provide a 
feature vector which describes the image colors of 
the shots, 

15. The method according to claim 1 0, wherein the step 
of clustering Includes the steps of; 

providing a single one of the groups containing 
a first shot; 

getting a color feature vector of a new shot, the 
color feature vector based on the predeter- 
mined list of color names: 
find in g a nearest match between the vector and 
group means of existing groups: and 
determining if ihe nearest match is less than a 
predetermined color threshold. 

1 6. The method according to claim 1 5, wherein the step 
of clustering further includes the steps of: 

adding the new shot to a group producing the 
nearest match if the nearest match is less than 
the color threshold; and 
updating means of the group producing the 
nearest match. 

17- The method according to claim 1 5, wherein the step 
of clustering further includes the steps of: 

creating a new group with the color feature vec- 
tor as its mean if the nearest: match is not less 
than the predetermined color threshold: and 
adding the shot to the new group, 

18. The method according to claim 1 1 > wherein the step 
of using image edge information includes the steps 
of: 

computing edge feature vectors E for each shot 
of groups having more than one shot:; and 
computing mean edge feature vector M for 
each group having more than one shot. 

19, The method according to claim 1 8, wherein the step 
of using image edge information fu rther includes the 
steps of: 

finding a shot of a group which gives a maxi- 



mum absolute value for (M-E); and 
determining if the maximum absolute value for 
(M-E) is greater than a predetermined edge 
threshold,. 

20, The method according to claim 19 ; wherein the step 
of using image edge information further includes the 
steps of : 

io deleting the shot of the group if the maximum 

absolute value for (M-£) is greater than the pre- 
determined threshold and place the shot in a 
new group; and 

recomputing the mean edge feature vector of 
*s the group with the removed shot. 

21 , The method according to claim 20, further compris- 
ing the step of writing a marge-list which specifies 
a group number for each at the shots. 

B0 

22, A method for a utomaticaiEy organizing groups of vis- 
ually similar shots, which capture structure in a vid- 
eo, into a hierarchical structure that depicts the vid- 
eo% organized contents, the method comprising the 

£5 steps of: 

(a) creating a root node at a first level of aaid 
hierarchical structure which represents the vid- 
eo in its entirety; 
so (b) finding story units in the groups of shots ob- 

tained from said video, each of the story units 
extending to a last re-occurrence of a shot 
which occurs within the story unit; 

(c) creating a story node for each of the story 
35 units, the story nodes defining main branches 

of the hierarchical structure; 

(d) finding structure within each of the story 
units; and 

(e) attaching the structure as a child node to the 
& main branches, the child node defining the sec- 

and&ry branches of the hierarchical structure, 

23, The method according to claim 22, wherein step jfe) 
includes the step o! smrcfting through consecutive 

45 similar shots for a last re-occurrence of a shot to 
find each of the story units, 

24, The method according to claim 23. wherein step (c) 
further includes the step of grouping consecutive 

so shots which are similar under a corresponding stony 
node. 

25, The method according to claim 22 ; wherein step (d) 
includes the steps of 

SB 

traversing through a segment of consecutive 
shots until a reoccurnng shot is found: and 
creating a node for each of the shots until a re- 
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occurring shot is found 

28. The method according to claim 25.. where step id) 
further includes the steps a!: 

dividing a remaining portion of the segment o! 
consecutive shots into sub-segments; and 
creating a sub-plot node for each of the sub- 
segments, each of the sub-segments identified 
by a corresponding reoccu rring shot 1 o 

27, The method according to claim 2B : wherein said 
step (d) further includes the steps of 

finding consecutive shots which are similar: *5 
creating a similar node; and 
grouping consecutive shots which are similar 
under the similar node, 

28, A method: for srueE actively organizing and browsing 
video, the method comprising the steps of: 

automatically organizing a raw video into a hi- 
erarchicai structure that depicts the video's or- 
ganized contents, and £5 
viewing and manually editing the hierarchical 
structure to make the hierarchical structure 
substantially useful and meansngf u! to the user 

29, The method according: to claim 28, wherein the step ^ 
of automatically organising includes the steps of: 

automatically detecting abrupt scene changes 
in raw frames of the video; and automatically 
organizing the raw frames into a fist of shots, 35 
each of the shots representing a continuous ac- 
tion in time and space. 

30, - The method according to claim 29; wherein the step 

of viewing includes the steps of: 40 

viewing the shots: and 
manually editing the shots 

31 , The method according: to claim 28, wherein the step <s 
of automatically organising includes the steps of: 

automatically grouping shots, which represent 
a continuous action in time and apace, into 
groups of visually similar shots, each group of B0 
shots capturing a given structure in the raw vid- 
eo. 



33. The method according to claim 3.2, wherein the step 
of manually editing the groups of shots includes the 
step of determining whether any image portions of 
a first shot merged with a second shot by the user 
are simitar, wherein when similar image portions are 
found, partial match templates are generated which 
block dissimilar image portions of remaining shots 
and automatically merges the remaining shots hav- 
ing image portions which are similar to the similar 
image portions o^ the merged first and second 
shots, wherein when similar image portions in the 
merged first and second shots are not found- deter- 
mining whether audio streams of the merged first 
and second shots are similar, wherein when similar 
audio streams are found in She merged first and sec- 
ond shots, searching audio streams ot the remain- 
ing: shots to determine if they are similar to the audio 
streams of the merged first and second shots and 
automatically merging the remaining shots, having 
audio streams thai are similar to the audio streams 
of the merged first and second shots, with the 
merged first and second shots, wherein when sirrv 
s!ar audio streams are not found in the merged first 
and second shots, the action taken by the user on 
the f i rst shot is automatically repeated on all siblings 
of the first shot. 

34,. The method according to claim 28, wherein the step 
ot automatically organizing includes the step of gen- 
erating the hierarchical structure from groups of vis- 
ually similar shots, each of the shots representing 
a continuous action in time and space. 

35. The method according to claim 34. wherein the step 
of viewing includes the steps of: 

viewing the hierarchical structure; and 
modifying the hierarchical structure to ma ke the 
hierarchical structure substantia^ useful and 
meaningful to the user. 

36, The method according to claim 28 . 'Wherein the step 
of viewing Includes the step of playing the video 
along with the video's audio from any point in the 
video. 



02, The method according to claim 31 , wherein the step 
of viewing includes the steps of: 

viewing the groups of visually similar shots; and 
manually editing the groups of shots. 



13 



EP 0 938 0S4A2 




EP 0 938 0S4A2 



FIG. 2A 

(PRIOR ART) 



PIXEL 
COUNTS 



250- 
2001 
150^ 
100- 
50 H 



i 
i 



25G*expHL5*abs(xjj 



I HISTOGRAM OF A TYPICAL 
4 INTER-FRAME DIFFERENCE IMAGE 

i 



0 



-200 -100 0 100 

INTENSITY DIFFERENCE 



200 



PIXEL 
COUNTS 



250 1 
200- 



FIG. 2B 

(PRIOR ART) 

i 
i 

[ 



250*exp f-0,5*ab$(xj) 

HISTOGRAM OF A TYPICAL 
4 INTER-FRAME DIFFERENCE IMAGE 



50 -j 

i 

0* 



» I 

-200 -100 0 100 
INTENSITY DIFFERENCE 



200 



is 



EP 0 938 0S4A2 



FIG, 3 



12 



f 



AUTOMATIC ALGORITHMS 



CUT DETECTION 



SHOT GROUPING 



VIDEO TABLE OF CONTENTS 
CREATION 




( INTERACTIVE INTERFACES 



BROWSER 
(CORRECT SHOT BOUNDARIES) 



TREE n& 
(CORRECT GROUPING RESULTS) 



14 

L 



TREE VIEW 
(CORRECT VTOCi 



VIOEO PLAYER 



10 



16 



EP 0 938 0S4A2 




17 



EP 0 938 0S4A2 



FIG. 5 A 




IS 



EP 0 938 0S4A2 



FIG. 58 




19 



EP 0 938 0S4A2 




20 



EP 0 938 0S4A2 



FIG, 6 A 

(PRIOR ART} 



RED 


GREEN 


REDDISH PURPLE 


REDDISH 8ROHN 


REDDISH ORANGE 


BLUISH GREEN 


PURPLISH RED 


BROWN 


ORANGE 


GREENISH SLUE 


PURPLISH PINK 


YELLOW GREEN 


ORANGE YELLOW 


BLUE 


PINK 


YELLOWISH BROWN 


yellow 


PURPLISH GLUE 


YELLOWISH PINK 


OLIVE GROWN 


GREENISH YELLOW 


VIOLET 


BROWNISH PINK 


OLIVE 


YELLOWISH GREEN 


PURPLE 


BROWNISH ORANGE 


OLIVE GREEN 



FIG. 68 

(PRIOR ART) 



LIGHTNESS 
(MUNSELL VALUE! 



VERY PALE 


VERY LIGHT 


BRILLIANT 




PALE 
LIGHT GRAYISH 


LIGHT 




GRAYISH 


MODERATE 


STRONG 


VIVID 


DARK GRAYISH 


DARK 


DEEP 




BLACKISH 


VERY DARK 


VERY DEEP 





SATURATION {MUNSELL CHROMA) 



21 



EP 0 938 0S4A2 



0 3 



FIG. 7 A 



0 6 - 
i 

0.5 f 
0 4 | 



r~ 



'22562 14.0UI' 
23009 1 4 out 



0 1 

o 1 ■■ 



4 6 S 
Color Number 




FIG. 7B 

(PRIOR ARB 



0 45 — < — — 

22562 268 out 



0 4 - fs '23009~268 out 




Color Number 



22 



EP 0 938 0S4A2 




EP 0 938 0S4A2 



FIG. 9 A 



Histogram of Anchor*p#r$ort Shots 




0 2 4 6 8 10 12 

Color Number 



FIG. 98 



Histogram of Soccar-fiefd Shots 



c 

o 
u 

1 




1 22577>ouf 
'22582-Ouf 
'22797 out* 
'2300* 



4 6 8 10 
Color Number 



12 



24 



EP0 938 0S4A2 

FIG. IDA 




FIG, 10B 



Tin in fit? ^Q&mM^Si 



FIG. IOC 



EP 0 938 0S4A2 



fig. n 



START KITH A SME CLUSTER 

mmm tie first shot 
I 



SET COLOR F&ATLBE VECTOR. C, Of NEK SHOT 



I 



m NEAREST HATCH BETWEEN C AND SSS Of EXISTING CLUSTERS 



1 



YES 



ADO SHOT TO TIE CLUSTER PRODUCING 
NEAREST HATCH UPDATE CLUSTER MEAN 

after kusoe m shot vectob 




create a m cluster kith c as 

MEAN ANO ADD NEK SHOT TO IT 



X 




YES m\\ 
SHOTS 
EXIST; 



Wk ALL CLUSTERS KITH SORE 
THAN ONE MEMBER AS UNCHECKED 

I 



t m m wmm cluster kith m than one mm 
compute m mm vectors, e 

COMPUTE MEAN EC6E FEATURE VECTOR, H 



•i F INO THE mm OF CLUSTER KHICH GIVES MAXIMUM | N 

1 x 



! DELETE MESSER FROM 




i CLUSTER AND PLACE IN 


IS 


\ NEK CLUSTER RECOMPUTE 




! CLUSTER: MEAN 





X 

CLUSTER' 
STILL CONTAINS 
MORE THAN ONE 
JfMSFJP, 




■ — N 



MORE 

liNOECKED CLlSTERS\IES 
WTH MIU1PLE 
MEMBERS' 





F krite" in A MERSELIST MHICH SPECIFIES THE 

! m? mm m each shot in the shoilist 



26 



SP G 938 054 A2 



FIG . 12 




27 



EP 0 938 0S4A2 



FIG, 13 A 

qCREATEH/TGC-FROH-MERGED-LIST 



CREATE A PSEUOO NODE AT LEVEL i TO R£PR£ 


SENT THE ENTIRE VIDEO \ 


M 


ILEV£L=2 





•50 



E4HE SHOT INDEX OF THE LAST OCCURENCE OF SHOT L IN THE VIDEO 




-52 



SHOT INDEX" 
OF THE LAST OCCURENCE 
SHOT L IN THE VIDEO 
.GREATER THAN E?_ 



64 



^ [ 




1 



58 



E=THE SHOT INDEX OF THE LAST 
OCCURENCE OF SHOT L IN THE VIDEO 



Hi/- 



CREATE A STORY NODE AT 
LEVEL ILEVEL TO REPRESENT 
THE STORY FIN0-STRUCTU8E 

(RE. ILEVa+i) 
1 



1 



66 



FIND-SIRUCTURE (N. E, ILEVEL) 




72 



26 



EP 0 938 0S4A2 



FIG, 138 



Q f INO- STIIUCISE fSTART-SHQT. EN0-SH8T. USE) 



S-SM-SHPT 




3EI-GF-SH0T34HE LIST OF SHOT INDICES OF 
all qccurskes OF SHOT S IN ASCENDING ORDER 



85 





tlj^rt I — ■* ' w. 

Kb I CREATE A NOQE AT LEVa 
ILEVEi (^PRESENTING SHOT S 



SHHE 1st ELQfiff IN SEHF-SHOTS 
SHHE M ELEMENT IN SET-GF-SHOTS 



104 



1 



[create a similar mode 

1 AT LEVEi IlEVEL 






CREATE A SUBPLOT NODE AT LtVa ILEVEL REPRESENTING A WW OF SHOTS 
CREATE A NODE AT LEVEL !lEYa+i REP&SBiTINS SHOT Si 
PINO-SMURE (S 1*1,52-5. ILEVEL H) 



MS SI TO THE SIMM NODE 
$1=32 

SHf£ ELEMENT AFTER S2 IN SERF-SHOTS 
3— 



92 



S1 S S2 

S2=1HE ELEMENT AFTER S2 IN SET~0F-SHOTS 




CREATE k m AT LEVEL L/ 
ILEVEL REPRESENTING 
SHOT S2 



CP8TE A SUBPLOT NODE AT LEVEL ILEVEL REPRESENTING A SBflUP Of SHOTS 
mil A NODE AT LEVEL ILEVEL* 1 REPRESENTING SHOT S2 
FM-STRUCTORE (S2H, END-SHOT, ILEVELH} 



■98 



-102 



29 



EP 0 938 0S4A2 



FIG. 14 




30 



EP 0 938 0S4A2 



FIG. 15 




31 



